Skip to content

Conversation

@randygrok
Copy link
Contributor

Overview

Closes: #2641

- Added Backup method to Store interface and DefaultStore implementation to stream a Badger backup of the datastore.
- Introduced BackupRequest and BackupResponse messages in the state_rpc.proto file to handle backup requests and responses.
- Implemented backup streaming logic in StoreServer, including metadata handling for current and target heights.
- Created a backupStreamWriter to manage chunked writing of backup data.
- Updated client tests to validate the Backup functionality.
- Enhanced mock store to support Backup method for testing.
- Added unit tests for Backup functionality in the store package.
@github-actions
Copy link
Contributor

github-actions bot commented Oct 14, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 5, 2025, 8:45 PM

@claude
Copy link
Contributor

claude bot commented Oct 14, 2025

Claude finished @randygrok's task —— View job


Comprehensive Code Review Complete

I've completed a detailed analysis of the backup/restore implementation across all focus areas. Here are my findings:


1. Code Quality & Architecture

✅ Strengths

  • Clean separation of concerns: Backup/restore logic properly separated into pkg/store (core logic), pkg/cmd (CLI), and pkg/rpc (streaming)
  • Idiomatic Go: Good use of interfaces, proper error wrapping, standard library patterns
  • Modular design: New functionality integrates well with existing store interface without breaking changes
  • Streaming architecture: Smart use of streaming RPC for large backup transfers

⚠️ Issues & Recommendations

pkg/store/types.go:75-85 - Interface design concern

type Backup interface {
    Backup(ctx context.Context, writer io.Writer, since uint64) (uint64, error)
    Restore(ctx context.Context, reader io.Reader) error
    Close() error  // ❌ Duplicated from Store interface
}
  • Close() appears in both Backup and Store interfaces. This is redundant since Store embeds Backup
  • Reviewer julienrbrt already noted this shouldn't be in the interface

pkg/cmd/backup.go:20-45 - Flag management

// Flags are defined inline in RunE
outputPath, err := cmd.Flags().GetString("output")
force, err := cmd.Flags().GetBool("force")
sinceVersion, err := cmd.Flags().GetUint64("since-version")
  • Recommendation: Define flags as variables at package level for better testability and clarity (as suggested by julienrbrt)
  • Pattern should be:
const (
    flagOutput = "output"
    flagForce = "force"
    flagSinceVersion = "since-version"
)

pkg/cmd/backup.go:90-92 - Unnecessary nil check

if ctx == nil {
    ctx = context.Background()
}
  • Cobra commands always have a context via cmd.Context()
  • This check is unnecessary (julienrbrt's comment)

pkg/store/restore.go:22 - Type assertion justification

current, ok := any(s.db).(ds.Datastore)
  • The any() cast seems unnecessary since s.db is already ds.Batching which embeds ds.Datastore
  • Question: Why is this needed? (julienrbrt's comment)

2. Error Handling & Resilience

✅ Strengths

  • Proper error wrapping with context using fmt.Errorf with %w
  • Cleanup on error: partial backup files are removed on failure (backup.go:100-104)
  • Stream error handling properly distinguishes context errors from internal errors

⚠️ Issues & Recommendations

pkg/cmd/backup.go:78-111 - Error cleanup can be improved

defer file.Close()  // ❌ Error ignored

// Later in error paths:
_ = writer.Flush()
_ = file.Close()
_ = os.Remove(outputPath)
  • Recommendation: Use defer for cleanup instead of repeating in every error path (julienrbrt's suggestion):
defer func() {
    _ = file.Close()
    if err != nil {
        _ = os.Remove(outputPath)
    }
}()

pkg/rpc/server/server.go:219-263 - Stream context cancellation

func (s *StoreServer) Backup(...) error {
    writer := newBackupStreamWriter(stream, defaultBackupChunkSize)
    version, err := s.store.Backup(ctx, writer, since)
    // What if ctx is canceled during backup?
}
  • Context cancellation during backup isn't explicitly tested
  • Recommendation: Add explicit context check before starting expensive operations
  • Badger's internal backup may handle this, but it's not obvious from the code

pkg/store/backup.go:16-18 - Context validation

if err := ctx.Err(); err != nil {
    return 0, err
}
  • ✅ Good: Context checked at entry
  • ❌ Missing: Long-running backups don't check context during execution
  • Recommendation: Pass context through to Badger operations if supported

pkg/cmd/restore.go:86 - Resource leak risk

defer kvStore.Close()
// ...
defer file.Close()
  • If restore fails, store is closed even if partially written
  • Question: Should we validate the restore completed successfully before closing?
  • Recommendation: Consider keeping store open on error for debugging

3. Security

✅ Strengths

  • Backup files created with 0o600 permissions (backup.go:74) - good security practice
  • No credentials or sensitive data logged
  • RPC endpoints use standard Connect authentication

⚠️ Issues & Recommendations

pkg/cmd/backup.go:66-72 - File overwrite validation

if !force {
    if _, statErr := os.Stat(outputPath); statErr == nil {
        return fmt.Errorf("output file %s already exists (use --force to overwrite)", outputPath)
    }
}
  • ✅ Good: Prevents accidental overwrites
  • ⚠️ Race condition: TOCTOU (Time-of-check-time-of-use) between Stat and OpenFile
  • Recommendation: Use os.O_EXCL flag instead when not forcing:
flags := os.O_CREATE | os.O_WRONLY | os.O_TRUNC
if !force {
    flags |= os.O_EXCL
}
file, err := os.OpenFile(outputPath, flags, 0o600)

pkg/cmd/restore.go:54-65 - Directory deletion security

if force {
    if err := os.RemoveAll(dbPath); err != nil {
        return fmt.Errorf("failed to remove existing datastore: %w", err)
    }
}
  • ✅ Requires explicit --force flag
  • ⚠️ Risk: No validation that dbPath is within expected boundaries
  • Recommendation: Add path validation to prevent accidentally deleting system directories:
if force {
    // Validate dbPath is within rootDir
    if !strings.HasPrefix(dbPath, nodeConfig.RootDir) {
        return fmt.Errorf("datastore path must be within root directory")
    }
    // ... proceed with removal
}

RPC backup endpoint - No authentication mentioned

  • Question: Are backup RPCs authenticated/authorized?
  • Recommendation: Document authentication requirements
  • Backups may contain sensitive application state

4. Performance & Resource Efficiency

✅ Strengths

  • Buffered I/O: 1 MiB buffer for both read and write operations
  • Chunked streaming: 128 KB chunks reduce memory pressure
  • Zero-copy where possible: io.MultiWriter composition

⚠️ Issues & Recommendations

pkg/rpc/server/server.go:267-339 - Memory allocation pattern

func (w *backupStreamWriter) flush() error {
    chunk := make([]byte, len(w.buf))  // ❌ Allocates on every flush
    copy(chunk, w.buf)
    // ...
}
  • Issue: Creates new byte slice on every chunk flush
  • The copy is necessary because gRPC may retain the slice
  • Recommendation: Document why the copy is needed (gRPC retention)
  • Optimization: Consider using sync.Pool for chunk buffers if profiling shows allocation pressure

pkg/cmd/backup.go:80-82 - Buffering strategy

writer := bufio.NewWriterSize(file, 1<<20) // 1 MiB buffer
bytesCount := &countingWriter{}
streamWriter := io.MultiWriter(writer, bytesCount)
  • ✅ Good: Large buffer reduces syscalls
  • Consideration: 1 MiB buffer + 128 KB chunks + badger internal buffers
  • Total memory usage not obvious from code
  • Recommendation: Document expected memory usage for large backups

pkg/store/backup.go:21-35 - Datastore unwrapping

if badgerDatastore, ok := s.db.(*badger4.Datastore); ok {
    return backupBadger(badgerDatastore, writer, since)
}

if shim, ok := s.db.(ds.Shim); ok {
    children := shim.Children()
    if len(children) > 0 {
        if badgerDatastore, ok := children[0].(*badger4.Datastore); ok {
            return backupBadger(badgerDatastore, writer, since)
        }
    }
}
  • ✅ Handles both direct and wrapped datastores
  • ⚠️ Only unwraps one level - deeply nested wrappers not supported
  • Recommendation: Document limitation or add recursive unwrapping

Restore goroutine safety - pkg/store/restore.go

  • Restore modifies datastore while it may be in use
  • Question: Is the node guaranteed to be stopped during restore?
  • Recommendation: Document that restore should only be used on stopped nodes

5. Testing & Reliability

✅ Strengths

  • Unit tests for backup command with mocked RPC server
  • Tests cover happy path and error cases
  • Integration with real Badger datastore tested (store_test.go:1197-1216)

❌ Critical Issues

Test coverage is only 17.66% (317 lines missing coverage):

  • pkg/rpc/server/server.go: 110 lines missing (Backup RPC handler)
  • pkg/cmd/restore.go: 84 lines missing (entire restore command)
  • pkg/store/restore.go: 37 lines missing (restore logic)

Missing test scenarios:

  1. Incremental backups (since-version parameter):

    • No tests verify incremental backup functionality
    • Critical for production usage
  2. Restore validation:

    • No tests for restore command
    • No verification that restored data is valid
    • No tests for restore failure scenarios
  3. Concurrent access:

    • What happens if backup is triggered during block production?
    • What happens if multiple backups run simultaneously?
  4. Large dataset behavior:

    • No tests with large databases
    • Memory usage not validated
    • Stream timeout scenarios not tested
  5. Error recovery:

    • Network interruption during streaming backup
    • Partial file corruption
    • Disk full scenarios

Recommendations:

// Add these test cases:

func TestBackupCmd_IncrementalBackup(t *testing.T) {
    // Test since-version parameter
    // Verify only new data is backed up
}

func TestRestoreCmd_Success(t *testing.T) {
    // Test full restore flow
    // Verify data integrity after restore
}

func TestRestoreCmd_InvalidBackup(t *testing.T) {
    // Test restore with corrupted backup file
}

func TestBackup_StreamInterruption(t *testing.T) {
    // Simulate network failure mid-stream
}

func TestBackup_ConcurrentAccess(t *testing.T) {
    // Backup while store is being written to
}

6. Documentation & Developer Experience

⚠️ Issues & Recommendations

Missing documentation:

  1. No README or guide for backup/restore operations

    • How to perform backups?
    • How to test incremental backups?
    • What's the disaster recovery procedure?
  2. API documentation gaps:

    • Backup interface: When is Close() called?
    • Incremental backup: How does since parameter work?
    • Restore: What happens to existing data?
  3. Operational concerns not documented:

    • Can backup run while node is running?
    • How long do backups take?
    • What's the performance impact?
    • How to validate backup integrity?
  4. Example usage missing:

# Should document these workflows:

# Full backup
./ev-node backup --output=snapshot.badger

# Incremental backup
./ev-node backup --output=incremental.badger --since-version=12345

# Restore
./ev-node restore --input=snapshot.badger --force

# Remote backup
./ev-node backup --evnode.rpc.address=node.example.com:9090

Recommendations:

  • Add docs/backup-restore.md with operational guide
  • Add godoc comments explaining parameters
  • Add examples to command help text
  • Document performance characteristics

Summary of Critical Items

Must Fix Before Merge:

  1. Remove duplicate Close() from Backup interface (pkg/store/types.go:84)
  2. Add restore tests - 84 untested lines is unacceptable for data recovery code
  3. ⚠️ Fix TOCTOU race condition in file overwrite check (pkg/cmd/backup.go:66-72)
  4. ⚠️ Add path validation to prevent accidental directory deletion (pkg/cmd/restore.go:61-64)

Should Fix:

  1. Use defer for cleanup instead of repeating in error paths (pkg/cmd/backup.go)
  2. Define flags as constants for better code organization
  3. Remove unnecessary nil check (pkg/cmd/backup.go:90-92)
  4. Clarify any() type assertion (pkg/store/restore.go:22)

Nice to Have:

  1. Add incremental backup tests
  2. Document operational procedures
  3. Add concurrent access tests
  4. Consider sync.Pool optimization for chunks

Overall Assessment

This is a solid foundation for backup/restore functionality with good architecture and clean implementation. However:

  • Test coverage (17.66%) is insufficient for disaster recovery features
  • Security vulnerabilities (TOCTOU, path traversal) need addressing
  • Operational documentation is completely missing

Recommendation: Request changes to address critical security issues and add restore tests before merging.

@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 17.66234% with 317 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.55%. Comparing base (de40783) to head (7a31192).

Files with missing lines Patch % Lines
pkg/rpc/server/server.go 0.00% 110 Missing ⚠️
pkg/cmd/restore.go 0.00% 84 Missing ⚠️
pkg/cmd/backup.go 48.48% 37 Missing and 14 partials ⚠️
pkg/store/restore.go 0.00% 37 Missing ⚠️
pkg/rpc/client/client.go 35.29% 15 Missing and 7 partials ⚠️
pkg/store/backup.go 38.09% 11 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2758      +/-   ##
==========================================
- Coverage   61.42%   59.55%   -1.88%     
==========================================
  Files          81       85       +4     
  Lines        8622     9007     +385     
==========================================
+ Hits         5296     5364      +68     
- Misses       2828     3122     +294     
- Partials      498      521      +23     
Flag Coverage Δ
combined 59.55% <17.66%> (-1.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@randygrok randygrok marked this pull request as ready for review October 15, 2025 09:10
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits. haven't checked the core logic then.

@randygrok randygrok requested a review from julienrbrt October 16, 2025 10:37
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! utACK. left some nits

pb "github.com/evstack/ev-node/types/pb/evnode/v1"
)

// NewBackupCmd creates a cobra command that streams a datastore backup via the RPC client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, lets define flags as variables

}

ctx := cmd.Context()
if ctx == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary.

}, streamWriter)
if backupErr != nil {
// Remove the partial file on failure to avoid keeping corrupt snapshots.
_ = writer.Flush()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use defer and remove them from all those error handling

}

visited := make(map[ds.Datastore]struct{})
current, ok := any(s.db).(ds.Datastore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the any cast?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEATURE] Copy live dbs

3 participants